Analysis of Airbnb NYC

Group Members:

Aakash Shetty

Pratik Patil

Saket Tulsan

Saiprasad Bahulekar

Vaibhavi Mulay

Airbnb is a paid community platform for renting and booking private accommodation founded in 2008. Airbnb allows individuals to rent all or part of their own home as extra accommodation. The site offers a search and booking platform between the person offering their accommodation and the vacationer who wishes to rent it. It covers more than 1.5 million advertisements in more than 34,000 cities and 191 countries. From creation, in August 2008, until June 2012, more than 10 million nights have been booked on Airbnb.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

Problem Statement

Best prediction model for price i.e relationship between the price and other factors

Audience

Travelers and Hosts using Airbnb

Dataset

The dataset which we have used over here is New York City Airbnb Open Data. The dataset is available on kaggle. It has 16 columns and 48895 rows.

Below you will find the implementation of a few processes we have done for analysis. You can jump to the sections:

1. Data Cleaning
2. Exploratory Data Cleaning
3. Statistics and Machine Learning

Data Setup

First we will import the library such as numpy, scipy and matplotlib to manipulate, analyze and visualize our data. The second task for setting up our data set is by importing our dataset from a csv to our notebook. Here the csv file is converted into a set of data frames

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
import seaborn as sns
import pandas_profiling

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
In [50]:
#using pandas library and 'read_csv' function to read  csv file as file already formated for us from Kaggle
airbnb=pd.read_csv('AB_NYC_2019.csv')
#examing head of  csv file 
airbnb.head(10)
Out[50]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.65 -73.97 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75 -73.98 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.81 -73.94 Private room 150 3 0 NaN nan 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.69 -73.96 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.80 -73.94 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.75 -73.97 Entire home/apt 200 3 74 2019-06-22 0.59 1 129
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.69 -73.96 Private room 60 45 49 2017-10-05 0.40 1 0
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76 -73.98 Private room 79 2 430 2019-06-24 3.47 1 220
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80 -73.97 Private room 79 2 118 2017-07-21 0.99 1 0
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71 -73.99 Entire home/apt 150 1 160 2019-06-09 1.33 4 188
In [3]:
#profiling helps understanding the distribution of data
pandas_profiling.ProfileReport(airbnb)
Out[3]:

Data Cleaning

The first step we will do here is cleaning our data. Here we will do operations such as getting our data into a standard format, handling null values, removing unneccesary columns or values etc.

In [4]:
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id                                48895 non-null int64
name                              48879 non-null object
host_id                           48895 non-null int64
host_name                         48874 non-null object
neighbourhood_group               48895 non-null object
neighbourhood                     48895 non-null object
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null object
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       38843 non-null object
reviews_per_month                 38843 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
In [5]:
total = airbnb.isnull().sum().sort_values(ascending=False)
percent = ((airbnb.isnull().sum())*100)/airbnb.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)
Out[5]:
Total Percent
reviews_per_month 10052 20.558339
last_review 10052 20.558339
host_name 21 0.042949
name 16 0.032723
availability_365 0 0.000000
calculated_host_listings_count 0 0.000000
number_of_reviews 0 0.000000
minimum_nights 0 0.000000
price 0 0.000000
room_type 0 0.000000
longitude 0 0.000000
latitude 0 0.000000
neighbourhood 0 0.000000
neighbourhood_group 0 0.000000
host_id 0 0.000000
id 0 0.000000
In [54]:
airbnb['adjusted_price'] = airbnb.price/airbnb.minimum_nights

airbnb.head()
Out[54]:
name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 Clean & quiet apt home by the park 2787 Brooklyn Kensington 40.65 -73.97 Private room 149 1 9 2018-10-19 0.21 6 365 149.00
1 Skylit Midtown Castle 2845 Manhattan Midtown 40.75 -73.98 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 225.00
2 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Manhattan Harlem 40.81 -73.94 Private room 150 3 0 NaN nan 1 365 50.00
3 Cozy Entire Floor of Brownstone 4869 Brooklyn Clinton Hill 40.69 -73.96 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 89.00
4 Entire Apt: Spacious Studio/Loft by central park 7192 Manhattan East Harlem 40.80 -73.94 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 8.00
In [7]:
airbnb["last_review"] = pd.to_datetime(airbnb.last_review)

airbnb.head()
Out[7]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaT NaN 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 8.0
In [8]:
airbnb["reviews_per_month"] = airbnb["reviews_per_month"].fillna(airbnb["reviews_per_month"].mean())
airbnb.head()
Out[8]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.210000 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.380000 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaT 1.373221 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.640000 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.100000 1 0 8.0
In [9]:
airbnb.last_review.fillna(method="ffill", inplace=True)

airbnb.head()
Out[9]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.210000 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.380000 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 2019-05-21 1.373221 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.640000 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.100000 1 0 8.0
In [10]:
for column in airbnb.columns:
    if airbnb[column].isnull().sum() != 0:
        print("=======================================================")
        print(f"{column} ==> Missing Values : {airbnb[column].isnull().sum()}, dtypes : {airbnb[column].dtypes}")
        
for column in airbnb.columns:
    if airbnb[column].isnull().sum() != 0:
        airbnb[column] = airbnb[column].fillna(airbnb[column].mode()[0])
        
airbnb.isnull().sum()
=======================================================
name ==> Missing Values : 16, dtypes : object
=======================================================
host_name ==> Missing Values : 21, dtypes : object
Out[10]:
id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
adjusted_price                    0
dtype: int64
In [11]:
pd.options.display.float_format = "{:.2f}".format
airbnb.describe()
Out[11]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 adjusted_price
count 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00
mean 19017143.24 67620010.65 40.73 -73.95 152.72 7.03 23.27 1.37 7.14 112.78 70.17
std 10983108.39 78610967.03 0.05 0.05 240.15 20.51 44.55 1.50 32.95 131.62 157.62
min 2539.00 2438.00 40.50 -74.24 0.00 1.00 0.00 0.01 1.00 0.00 0.00
25% 9471945.00 7822033.00 40.69 -73.98 69.00 1.00 1.00 0.28 1.00 0.00 20.00
50% 19677284.00 30793816.00 40.72 -73.96 106.00 3.00 5.00 1.22 1.00 45.00 44.50
75% 29152178.50 107434423.00 40.76 -73.94 175.00 5.00 24.00 1.58 2.00 227.00 81.50
max 36487245.00 274321313.00 40.91 -73.71 10000.00 1250.00 629.00 58.50 327.00 365.00 8000.00
In [52]:
# Drop ["id", "host_name"] because it is insignificant and also for ethical reasons.
airbnb.drop(["id", "host_name"], axis="columns", inplace=True)
airbnb.head()
Out[52]:
name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 Clean & quiet apt home by the park 2787 Brooklyn Kensington 40.65 -73.97 Private room 149 1 9 2018-10-19 0.21 6 365
1 Skylit Midtown Castle 2845 Manhattan Midtown 40.75 -73.98 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Manhattan Harlem 40.81 -73.94 Private room 150 3 0 NaN nan 1 365
3 Cozy Entire Floor of Brownstone 4869 Brooklyn Clinton Hill 40.69 -73.96 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 Entire Apt: Spacious Studio/Loft by central park 7192 Manhattan East Harlem 40.80 -73.94 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
In [13]:
categorical_col = []
for column in airbnb.columns:
    if len(airbnb[column].unique()) <= 10:
        print("===============================================================================")
        print(f"{column} : {airbnb[column].unique()}")
        categorical_col.append(column)
===============================================================================
neighbourhood_group : ['Brooklyn' 'Manhattan' 'Queens' 'Staten Island' 'Bronx']
===============================================================================
room_type : ['Private room' 'Entire home/apt' 'Shared room']

Exploratory Data Analysis

Exporatory Data analysis or EDA is an approach to analyzing your dataset to summarize their characteristics often with visual methods. For the above given dataset we have explored the attributes using appropriate graphical model. This will help us to understand the nature of our data, its behavoir and so on. In the below sections we will analyze our data that with try to answers quesion like why, where and how the factors affect the airbnb ratings and prices.

In [14]:
import plotly.graph_objs as go

#Access token from Plotly
mapbox_access_token = 'pk.eyJ1Ijoia3Jwb3BraW4iLCJhIjoiY2pzcXN1eDBuMGZrNjQ5cnp1bzViZWJidiJ9.ReBalb28P1FCTWhmYBnCtA'

#Prepare data for Plotly
data = [
    go.Scattermapbox(
        lat=airbnb.latitude,
        lon=airbnb.longitude,
        mode='markers',
        text=airbnb[['neighbourhood_group','number_of_reviews','adjusted_price']],
        marker=dict(
            size=7,
            color=airbnb.adjusted_price,
            colorscale='RdBu',
            reversescale=True,
            colorbar=dict(
                title='Adjusted Price'
            )
        ),
    )
]
In [15]:
#Prepare layout for Plotly
layout = go.Layout(
    autosize=True,
    hovermode='closest',
    title='NYC Airbnb ',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.721319,
            lon=-73.987130
        ),
        pitch=0,
        zoom=11
    ),
)
In [16]:
from plotly.offline import init_notebook_mode, iplot
#Create map using Plotly
fig = dict(data=data, layout=layout)
iplot(fig, filename='NYC Airbnb')
In [17]:
airbnb[airbnb.adjusted_price > 5000]
Out[17]:
name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
3720 SuperBowl Penthouse Loft 3,000 sqft 1483320 Manhattan Little Italy 40.72 -74.00 Entire home/apt 5250 1 0 2016-03-13 1.37 1 0 5250.00
3774 SUPER BOWL Brooklyn Duplex Apt!! 11598359 Brooklyn Clinton Hill 40.69 -73.96 Entire home/apt 6500 1 0 2018-07-14 1.37 1 0 6500.00
4377 Film Location 1177497 Brooklyn Clinton Hill 40.69 -73.97 Entire home/apt 8000 1 1 2016-09-15 0.03 11 365 8000.00
15560 Luxury townhouse Greenwich Village 66240032 Manhattan Greenwich Village 40.73 -74.00 Entire home/apt 6000 1 0 2017-12-30 1.37 1 0 6000.00
29662 East 72nd Townhouse by (Hidden by Airbnb) 156158778 Manhattan Upper East Side 40.77 -73.96 Entire home/apt 7703 1 0 2018-09-21 1.37 12 146 7703.00
29664 Park Avenue Mansion by (Hidden by Airbnb) 156158778 Manhattan Upper East Side 40.79 -73.95 Entire home/apt 6419 1 0 2018-09-21 1.37 12 45 6419.00
42523 70' Luxury MotorYacht on the Hudson 7407743 Manhattan Battery Park City 40.71 -74.02 Entire home/apt 7500 1 0 2019-05-31 1.37 1 364 7500.00
44034 3000 sq ft daylight photo studio 3750764 Manhattan Chelsea 40.75 -74.00 Entire home/apt 6800 1 0 2019-06-15 1.37 6 364 6800.00
45666 Gem of east Flatbush 262534951 Brooklyn East Flatbush 40.66 -73.92 Private room 7500 1 8 2019-07-07 6.15 2 179 7500.00
In [18]:
import plotly.express as px

## Setting up the Visualization..
fig = px.scatter_mapbox(airbnb, 
                        hover_data = ['price','minimum_nights','room_type'],
                        hover_name = 'neighbourhood',
                        lat="latitude", 
                        lon="longitude", 
                        color="neighbourhood_group", 
                        size="price",
#                         color_continuous_scale=px.colors.cyclical.IceFire, 
                        size_max=30, 
                        opacity = .70,
                        zoom=10,
                       )
# "open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or 
# "stamen-watercolor" yeild maps composed of raster tiles from various public tile servers which do 
# not require signups or access tokens
# fig.update_layout(mapbox_style="carto-positron", 
#                  )
fig.layout.mapbox.style = 'stamen-terrain'
fig.update_layout(title_text = 'Airbnb by Borough in NYC<br>(Click legend to toggle borough)', height = 800)
fig.show()

The first graph is about the relationship between price and room type. The Shared room price is always lower than 2000 dollars. On the other hand, the private room and the entire home have the highest price in some.

In [19]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,12))
sns.scatterplot(x='room_type', y='price', data=airbnb)

plt.xlabel("Room Type", size=13)
plt.ylabel("Price", size=13)
plt.title("Room Type vs Price",size=15, weight='bold')
Out[19]:
Text(0.5, 1.0, 'Room Type vs Price')

Below graph shows details about price and room type based on neighborhood group. The highest price of Private Room and Entire Home/Apt is in the same area which is Manhattan. Also, Brooklyn has very-high prices both in Private Room and Entire Home/Apt. On the other hand, shared room's highest price is in the Queens area and also in Staten Island.

In [20]:
plt.figure(figsize=(20,15))
sns.scatterplot(x="room_type", y="price",
            hue="neighbourhood_group", size="neighbourhood_group",
            sizes=(50, 200), palette="Dark2", data=airbnb)

plt.xlabel("Room Type", size=13)
plt.ylabel("Price", size=13)
plt.title("Room Type vs Price vs Neighbourhood Group",size=15, weight='bold')
Out[20]:
Text(0.5, 1.0, 'Room Type vs Price vs Neighbourhood Group')
In [21]:
f,ax=plt.subplots(1,2,figsize=(18,8))
airbnb['neighbourhood_group'].value_counts().plot.pie(explode=[0,0.05,0,0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Share of Neighborhood')
ax[0].set_ylabel('Neighborhood Share')
sns.countplot('neighbourhood_group',data=airbnb,ax=ax[1],order=airbnb['neighbourhood_group'].value_counts().index)
ax[1].set_title('Share of Neighborhood')
plt.show()
In [55]:
plt.figure(figsize=(10,6))
sns.distplot(airbnb[airbnb.neighbourhood_group=='Manhattan'].adjusted_price,color='maroon',hist=False,label='Manhattan')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Brooklyn'].adjusted_price,color='black',hist=False,label='Brooklyn')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Queens'].adjusted_price,color='green',hist=False,label='Queens')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Staten Island'].adjusted_price,color='blue',hist=False,label='Staten Island')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Long Island'].adjusted_price,color='lavender',hist=False,label='Long Island')
plt.title('Borough wise price probability distribution for adjusted_price<1000')
plt.xlim(0,1000)
plt.show()
In [59]:
#we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization

#creating a sub-dataframe with no extreme values / less than 500
sub_6=airbnb[airbnb.adjusted_price < 200]
#using violinplot to showcase density and distribtuion of prices 
viz_2=sns.violinplot(data=sub_6, x='neighbourhood_group', y='adjusted_price')
viz_2.set_title('Density and distribution of prices for each neighberhood_group')
Out[59]:
Text(0.5, 1.0, 'Density and distribution of prices for each neighberhood_group')

Great, with a statistical table and a violin plot we can definitely observe a couple of things about distribution of prices for Airbnb in NYC boroughs. First, we can state that Manhattan has the highest range of prices for the listings with 150 dollar price as average observation, followed by Brooklyn with 90 dollar per night. Queens and Staten Island appear to have very similar distributions, Bronx is the cheapest of them all. This distribution and density of prices were completely expected; for example, as it is no secret that Manhattan is one of the most expensive places in the world to live in, where Bronx on other hand appears to have lower standards of living.

In [24]:
from scipy.stats import norm

plt.figure(figsize=(10,10))
sns.distplot(airbnb['price'], fit=norm)
plt.title("Price Distribution Plot",size=15, weight='bold')
Out[24]:
Text(0.5, 1.0, 'Price Distribution Plot')

The above distribution graph shows that there is a right-skewed distribution on price. This means there is a positive skewness. Log transformation will be used to make this feature less skewed. This will help to make easier interpretation and better statistical analysis

Since division by zero is a problem, log+1 transformation would be better.

In [25]:
airbnb['price_log'] = np.log(airbnb.price+1)

With help of log transformation, now, price feature have normal distribution.

In [26]:
plt.figure(figsize=(12,10))
sns.distplot(airbnb['price_log'], fit=norm)
plt.title("Log-Price Distribution Plot",size=15, weight='bold')
Out[26]:
Text(0.5, 1.0, 'Log-Price Distribution Plot')

In below graph, the good fit indicates that normality is a reasonable approximation.

In [27]:
from scipy import stats

plt.figure(figsize=(7,7))
stats.probplot(airbnb['price_log'], plot=plt)
plt.show()
In [28]:
airbnb['neighbourhood_group']= airbnb['neighbourhood_group'].astype("category").cat.codes
airbnb['neighbourhood'] = airbnb['neighbourhood'].astype("category").cat.codes
airbnb['room_type'] = airbnb['room_type'].astype("category").cat.codes
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
name                              48895 non-null object
host_id                           48895 non-null int64
neighbourhood_group               48895 non-null int8
neighbourhood                     48895 non-null int16
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null int8
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       48895 non-null datetime64[ns]
reviews_per_month                 48895 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64
adjusted_price                    48895 non-null float64
price_log                         48895 non-null float64
dtypes: datetime64[ns](1), float64(5), int16(1), int64(6), int8(2), object(1)
memory usage: 5.0+ MB
In [29]:
airbnb_model = airbnb.drop(columns=['name','host_id', 
                                   'last_review','price','adjusted_price'])

plt.figure(figsize=(15,12))
palette = sns.diverging_palette(20, 220, n=256)
corr=airbnb_model.corr(method='pearson')
sns.heatmap(corr, annot=True, fmt=".2f", cmap=palette, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}).set(ylim=(11, 0))
plt.title("Correlation Matrix",size=15, weight='bold')
Out[29]:
Text(0.5, 1, 'Correlation Matrix')

The correlation table shows that there is no strong relationship between price and other features. This indicates no feature needed to be taken out of data.

Statistics and Machine Learning

Residual Plots

Residual Plot is strong method to detect outliers, non-linear data and detecting data for regression models. The below charts show the residual plots for each feature with the price.

An ideal Residual Plot, the red line would be horizontal. Based on the below charts, most features are non-linear. On the other hand, there are not many outliers in each feature. This result led to underfitting. Underfitting can occur when input features do not have a strong relationship to target variables or over-regularized. For avoiding underfitting new data features can be added or regularization weight could be reduced.

In this kernel, since the input feature data could not be increased, Regularized Linear Models will be used for regularization and polynomial transformation will be made to avoid underfitting.

In [30]:
airbnb_model_x, airbnb_model_y = airbnb_model.iloc[:,:-1], airbnb_model.iloc[:,-1]
In [31]:
f, axes = plt.subplots(5, 2, figsize=(15, 20))
sns.residplot(airbnb_model_x.iloc[:,0],airbnb_model_y, lowess=True, ax=axes[0, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,1],airbnb_model_y, lowess=True, ax=axes[0, 1],
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,2],airbnb_model_y, lowess=True, ax=axes[1, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,3],airbnb_model_y, lowess=True, ax=axes[1, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,4],airbnb_model_y, lowess=True, ax=axes[2, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,5],airbnb_model_y, lowess=True, ax=axes[2, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,6],airbnb_model_y, lowess=True, ax=axes[3, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,7],airbnb_model_y, lowess=True, ax=axes[3, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,8],airbnb_model_y, lowess=True, ax=axes[4, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,9],airbnb_model_y, lowess=True, ax=axes[4, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.setp(axes, yticks=[])
plt.tight_layout()
/Users/saket/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:3405: RuntimeWarning:

Invalid value encountered in median

/Users/saket/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/smoothers_lowess.py:165: RuntimeWarning:

invalid value encountered in greater_equal

Multicollinearity

Multicollinearity will help to measure the relationship between explanatory variables in multiple regression. If there is multicollinearity occurs, these highly related input variables should be eliminated from the model.

In this kernel, multicollinearity will be control with Eigen vector values results.

In [32]:
multicollinearity, V=np.linalg.eig(corr)
multicollinearity
Out[32]:
array([1.94766095, 1.64337523, 1.41516454, 1.26383356, 0.32595472,
       0.46300457, 0.66853039, 0.70054096, 0.76213034, 0.93539909,
       0.87440567])

None one of the eigenvalues of the correlation matrix is close to zero. It means that there is no multicollinearity exists in the data.

First, Standard Scaler technique will be used to normalize the data set. Thus, each feature has 0 mean and 1 standard deviation.

In [33]:
scaler = StandardScaler()
airbnb_model_x = scaler.fit_transform(airbnb_model_x)

Secondly, data will be split in a 70–30 ratio

In [34]:
X_train, X_test, y_train, y_test = train_test_split(airbnb_model_x, airbnb_model_y, test_size=0.3,random_state=42)

Now it is time to build a feature importance graph. For this Extra Trees Classifier method will be used. In the below code, lowess=True makes sure the lowest regression line is drawn.

In [35]:
lab_enc = preprocessing.LabelEncoder()

feature_model = ExtraTreesClassifier(n_estimators=50)
feature_model.fit(X_train,lab_enc.fit_transform(y_train))

plt.figure(figsize=(7,7))
feat_importances = pd.Series(feature_model.feature_importances_, index=airbnb_model.iloc[:,:-1].columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

The above graph shows the feature importance of dataset. According to that, neighborhood group and room type have the lowest importance on the model. Under this result, the model building will be made in 2 phases. In the first phase, models will be built within all features and in the second phase, models will be built without neighborhood group and room type features.

3. Model Building

Phase 1 - With All Features

Correlation matrix, Residual Plots and Multicollinearity results show that underfitting occurs on the model and there is no multicollinearity on the independent variables. Avoiding underfitting will be made with Polynomial Transformation since no new features can not be added or replaced with the existing ones.

In model building section, Linear Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression models will be built. These models will be used to avoiding plain Linear Regression and show the results with a little of regularization.

First, GridSearchCV algorithm will be used to find the best parameters and tuning hyperparameters for each model. In this algorithm 5-Fold Cross Validation and Mean Squared Error Regression Loss metrics will be used.

In [36]:
def linear_reg(input_x, input_y, cv=5):
    ## Defining parameters
    model_LR= LinearRegression()

    parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_LR = GridSearchCV(estimator=model_LR,  
                         param_grid=parameters,
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_LR.fit(input_x, input_y)
    best_parameters_LR = grid_search_LR.best_params_  
    best_score_LR = grid_search_LR.best_score_ 
    print(best_parameters_LR)
    print(best_score_LR)
In [37]:
def ridge_reg(input_x, input_y, cv=5):
    ## Defining parameters
    model_Ridge= Ridge()

    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    normalizes= ([True,False])

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_Ridge = GridSearchCV(estimator=model_Ridge,  
                         param_grid=(dict(alpha=alphas, normalize= normalizes)),
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_Ridge.fit(input_x, input_y)
    best_parameters_Ridge = grid_search_Ridge.best_params_  
    best_score_Ridge = grid_search_Ridge.best_score_ 
    print(best_parameters_Ridge)
    print(best_score_Ridge)
In [38]:
def lasso_reg(input_x, input_y, cv=5):
    ## Defining parameters
    model_Lasso= Lasso()

    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    normalizes= ([True,False])

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_lasso = GridSearchCV(estimator=model_Lasso,  
                         param_grid=(dict(alpha=alphas, normalize= normalizes)),
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_lasso.fit(input_x, input_y)
    best_parameters_lasso = grid_search_lasso.best_params_  
    best_score_lasso = grid_search_lasso.best_score_ 
    print(best_parameters_lasso)
    print(best_score_lasso)
In [39]:
def elastic_reg(input_x, input_y,cv=5):
    ## Defining parameters
    model_grid_Elastic= ElasticNet()

    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    normalizes= ([True,False])

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_elastic = GridSearchCV(estimator=model_grid_Elastic,  
                         param_grid=(dict(alpha=alphas, normalize= normalizes)),
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_elastic.fit(input_x, input_y)
    best_parameters_elastic = grid_search_elastic.best_params_  
    best_score_elastic = grid_search_elastic.best_score_ 
    print(best_parameters_elastic)
    print(best_score_elastic)

K-Fold Cross Validation

Before model building, 5-Fold Cross Validation will be implemented for validation.

In [40]:
kfold_cv=KFold(n_splits=5, random_state=42, shuffle=False)
for train_index, test_index in kfold_cv.split(airbnb_model_x,airbnb_model_y):
    X_train, X_test = airbnb_model_x[train_index], airbnb_model_x[test_index]
    y_train, y_test = airbnb_model_y[train_index], airbnb_model_y[test_index]

Polynomial Transformation

The polynomial transformation will be made with a second degree which adding the square of each feature.

In [41]:
Poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train = Poly.fit_transform(X_train)
X_test = Poly.fit_transform(X_test)

Model Prediction

In [42]:
##Linear Regression
lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(X_train, y_train)
lr_pred= lr.predict(X_test)


#Ridge Model
ridge_model = Ridge(alpha = 0.01, normalize = True)
ridge_model.fit(X_train, y_train)             
pred_ridge = ridge_model.predict(X_test) 

#Lasso Model
Lasso_model = Lasso(alpha = 0.001, normalize =False)
Lasso_model.fit(X_train, y_train)
pred_Lasso = Lasso_model.predict(X_test) 

#ElasticNet Model
model_enet = ElasticNet(alpha = 0.01, normalize=False)
model_enet.fit(X_train, y_train) 
pred_test_enet= model_enet.predict(X_test)

Phase 2 - Without All Features

All steps from Phase 1, will be repeated in this Phase. The difference is, neighbourhood_group and room_type features will be eliminated.

In [43]:
airbnb_model_xx = airbnb_model.drop(columns=['neighbourhood_group', 'room_type'])
In [44]:
airbnb_model_xx, airbnb_model_yx = airbnb_model_xx.iloc[:,:-1], airbnb_model_xx.iloc[:,-1]
X_train_x, X_test_x, y_train_x, y_test_x = train_test_split(airbnb_model_xx, airbnb_model_yx, test_size=0.3,random_state=42)
In [45]:
scaler = StandardScaler()
airbnb_model_xx = scaler.fit_transform(airbnb_model_xx)
In [46]:
kfold_cv=KFold(n_splits=4, random_state=42, shuffle=False)
for train_index, test_index in kfold_cv.split(airbnb_model_xx,airbnb_model_yx):
    X_train_x, X_test_x = airbnb_model_xx[train_index], airbnb_model_xx[test_index]
    y_train_x, y_test_x = airbnb_model_yx[train_index], airbnb_model_yx[test_index]
In [47]:
Poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_x = Poly.fit_transform(X_train_x)
X_test_x = Poly.fit_transform(X_test_x)
In [48]:
###Linear Regression
lr_x = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr_x.fit(X_train_x, y_train_x)
lr_pred_x= lr_x.predict(X_test_x)

###Ridge
ridge_x = Ridge(alpha = 0.01, normalize = True)
ridge_x.fit(X_train_x, y_train_x)           
pred_ridge_x = ridge_x.predict(X_test_x) 

###Lasso
Lasso_x = Lasso(alpha = 0.001, normalize =False)
Lasso_x.fit(X_train_x, y_train_x)
pred_Lasso_x = Lasso_x.predict(X_test_x) 

##ElasticNet
model_enet_x = ElasticNet(alpha = 0.01, normalize=False)
model_enet_x.fit(X_train_x, y_train_x) 
pred_train_enet_x= model_enet_x.predict(X_train_x)
pred_test_enet_x= model_enet_x.predict(X_test_x)

4. Model Comparison

In this part, 3 metrics will be calculated for evaluating predictions.

  • Mean Absolute Error (MAE) shows the difference between predictions and actual values.

  • Root Mean Square Error (RMSE) shows how accurately the model predicts the response.

  • R^2 will be calculated to find the goodness of fit measure.
  • In [49]:
    print('-------------Linear Regression-----------')
    
    print('--Phase-1--')
    print('MAE: %f'% mean_absolute_error(y_test, lr_pred))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test, lr_pred)))   
    print('R2 %f' % r2_score(y_test, lr_pred))
    
    print('--Phase-2--')
    print('MAE: %f'% mean_absolute_error(y_test_x, lr_pred_x))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test_x, lr_pred_x)))   
    print('R2 %f' % r2_score(y_test_x, lr_pred_x))
    
    print('---------------Ridge ---------------------')
    
    print('--Phase-1--')
    print('MAE: %f'% mean_absolute_error(y_test, pred_ridge))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test, pred_ridge)))   
    print('R2 %f' % r2_score(y_test, pred_ridge))
    
    print('--Phase-2--')
    print('MAE: %f'% mean_absolute_error(y_test_x, pred_ridge_x))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test_x, pred_ridge_x)))   
    print('R2 %f' % r2_score(y_test_x, pred_ridge_x))
    
    print('---------------Lasso-----------------------')
    
    print('--Phase-1--')
    print('MAE: %f' % mean_absolute_error(y_test, pred_Lasso))
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test, pred_Lasso)))
    print('R2 %f' % r2_score(y_test, pred_Lasso))
    
    print('--Phase-2--')
    print('MAE: %f' % mean_absolute_error(y_test_x, pred_Lasso_x))
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test_x, pred_Lasso_x)))
    print('R2 %f' % r2_score(y_test_x, pred_Lasso_x))
    
    print('---------------ElasticNet-------------------')
    
    print('--Phase-1 --')
    print('MAE: %f' % mean_absolute_error(y_test,pred_test_enet)) #RMSE
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test,pred_test_enet))) #RMSE
    print('R2 %f' % r2_score(y_test, pred_test_enet))
    
    print('--Phase-2--')
    print('MAE: %f' % mean_absolute_error(y_test_x,pred_test_enet_x)) #RMSE
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test_x,pred_test_enet_x))) #RMSE
    print('R2 %f' % r2_score(y_test_x, pred_test_enet_x))
    
    -------------Linear Regression-----------
    --Phase-1--
    MAE: 0.377923
    RMSE: 0.522021
    R2 0.527663
    --Phase-2--
    MAE: 0.531963
    RMSE: 0.685894
    R2 0.184227
    ---------------Ridge ---------------------
    --Phase-1--
    MAE: 0.377915
    RMSE: 0.522038
    R2 0.527631
    --Phase-2--
    MAE: 0.529255
    RMSE: 0.679340
    R2 0.199742
    ---------------Lasso-----------------------
    --Phase-1--
    MAE: 0.375922
    RMSE: 0.520400
    R2 0.530591
    --Phase-2--
    MAE: 0.523562
    RMSE: 0.671290
    R2 0.218595
    ---------------ElasticNet-------------------
    --Phase-1 --
    MAE: 0.371707
    RMSE: 0.518862
    R2 0.533362
    --Phase-2--
    MAE: 0.524883
    RMSE: 0.670878
    R2 0.219553
    

    The results show that all models have similar prediction results. Phase 1 and 2 have a great difference for each metric. All metric values are increased in Phase 2 it means, the prediction error value is higher in that Phase and model explainability are very low the variability of the response data around mean.

  • The MAE value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above results show that all predictions have great error especially in phase 2.
  • RMSE gives an idea of how much error the system typically makes in its predictions. The above results show that all models with each phase have significant errors.
  • R2 represents the proportion of the variance for a dependent variable that's explained by an independent variable. The above results show that, in phase 1, 52% of data fit the regression model while in phase 2, 20% of data fit the regression model.
  • Conclusion:

    Summarizing our findings

    This Airbnb ('AB_NYC_2019') dataset for the 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented.
    By creating a map which shows adjusted price of each and every listing, we saw how the pricing was distributed for each and every listing over the New York.Also, how listings are distributed according to borough, number of listings belonging to each borough, how pricing was distributed among each and every borough was examined.
    From this, we saw the mean price of listings for each and every borough which will help the customer to not overpay in specific area and not get fooled by the hosts.
    A model was fitted to predict price v/s every feature and price v/s all features expect neighbourhood_group, room_type. We saw that price dependent on each and every feature as the error for them was less than phase 2.
    Finally ElasticNet was the best model among others to predict the price.

    Future Work

    For our data exploration purposes, it also would be nice to have couple additional features, such as positive and negative numeric (0-5 stars) reviews or 0-5 star average review for each listing; addition of these features would help to determine the best-reviewed hosts for NYC along with 'number_of_review' column that is provided.
    To find better linear prediction model for price prediction we can make use of ADA Boost, XGR Boost, RandomForest Regressor which will help in that. If ratings were provided by the customers for each and every listing, depending on that, a recommendation system could be generated for them which will help to find the best listings according to their needs.